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CNl ' Abstract 

CN ' 

This paper gives specific divergence examples of value-iteration for sev- 
eral major Reinforcement Learning and Adaptive Dynamic Programming 
rn ■ algorithms, when using a function approximator for the value function. 

^ ' These divergence examples differ from previous divergence examples in the 

• , literature, in that they are applicable for a greedy policy, i.e. in a "value 
Y? • iteration" scenario. Perhaps surprisingly, with a greedy policy, it is also 

possible to get divergence for the algorithms TD(1) and Sarsa(f). In ad- 
dition to these divergences, we also achieve divergence for the Adaptive 
Dynamic Programming algorithms HDP, DHP and GDHP. 
> 

^ ■ 1 Introduction 

• ■ Adaptive Dynar nic ProKramming (ADP ) (|Wang et al.l . 120091) and Reinforcement 
C^ I Learning (RL) ([Sutton fc Bartol Il998l ) are similar fields of study that aim to 
^""■^ ■ make an agent learn actions that maximise a long-term reward function. These 

algorithms often rely on learning a "valu e function" that is defined in Bellman's 
Principle of Optimality (Bellmaij, [1957). When an algorithm attempts to learn 



this value function by a general smooth function approximator, while the agent 
j^ ■ is being controlled by a "greedy policy" on that approximated value function, 

JH I then ensuring convergence of the learning algorithm is difficult. 

It has so far been an open question as to whether divergence can occur 
under these conditions and for which algorithms. In this paper we present a 
simple artificial test problem which we use to make many RL and ADP algo- 
rithms diverge with a gree dy policy. The value function learning algorithms tha t 
we consider are Sarsa(A) (JRummerv fc Niranianl . ll994l ). TD(A) (|Suttonl . ll988l ). 



and the ADP algorithms Heuristic Dual Programming (HDP), Dual Heuristic 
Dynamic Progr amming ( DHP ) , Globalized Dual Heuristic Dynamic Program- 
ming (GDHP) (jWerbod . Il992l: (Prokhorov fc Wunscbl. IT997I: iFerrari fc Stengej 
2004) and Value- Gradient Learning (VGLfA). I Pairbank fc Alonsol . l201l[). We 



prove divergence of all of these algorithms (including VGL(O), VGL(l), Sarsa(O), 
Sarsa(l), TD(0) and TD(1)), all when operating with greedy policies, i.e. in a 
"value-iteration" setting. 



Some of these algorithms have convergence proofs when a fixed pohcy is 
used. For example TD(A) is proven to converge when A — 1 since it i s then 



(and only then) true gradient descent on a n error function (ISutton .|198S ) . Also 



for < A < 1 , it is proven to converge by iTsitsiklis fc Van Rovl ( 



1996al ) when 



the approximate value function is linear in its weight vector and learning is 
"on-policy" . However these convergence proofs do not apply to a greedy policy 
th a.t we consider in this pap er . 



Ferrari fc Stengell (|2004l ) show the ADP processes will converge to optimal 



behaviour if the value function could be perfectly learned over all of state space 
at each iteration. However in reality we must work with a function approxi- 
mator for the value function with finite capabilities, so this ass umption is not 



valid. Working with a general quadratic function approximator, (|Werbosl . ll998 , 
sections 7.7-7.8) proves the general instability of DHP and GDHP. This analysis 
was for a fixed policy, so with a greedy policy convergence would presumably 
seem even less likely. This paper confirms this. 

A key insight int o the difficulty of understa nding convergence with a greedy 



policy is shown by (jFairbank fc Alonsol . l201ll Lemma 7) that the dependency 



of a greedy action on the approximated value function is primarily through 
the value- gradient, i.e. the gradient of the value function with respect to the 
state vector. We use a value-gradient analysis i n this paper to understand the 
divergenc e of al l of the algorithms being tested. iFairbank fc Alonsol ()201lh and 



FairbankI ( 20081 ) recently defined a value- function learning algorithm that is 



proven to converge under certain smoothness conditions, using a greedy policy 
and an arbitrary smooth approximated value function, so this contrasts greatly 
to the diverging algorithm examples we give here. 

In the rest of this introduction (sections ll.mi.4p , we state the general RL/ ADP 
problem and give the necessary function definitions. In section [5] we give defi- 
nitions of the algorithms that we are testing. 

The approach we make to achieve divergence is to define a problem that is 
simple enough to analyse algebraically, but flexible enough to provide a diver- 
gence example (sections [51 l3.ip . We then analyse a trajectory for this problem 
(sections 13. 2113. 4p . so that we can write the VGL(A) weight update as a single 
dynamic system and hence examine what choice of parameters could be made to 
force this dynamic system to diverge (section|4]). The VGL(A) weight update is 
easier to analyse than the TD(A) one, since as mentioned above the greedy pol- 
icy depends on the value- gradient, so in section [5] we just use the same learning 
parameters that caused divergence for VGL(A) and find empirically that they 
cause the other algorithms to diverge too. 

Finally in section [6] we give conclusions and discuss the difficulty of ensuring 
value-iteration convergence but its potential advantages compared to policy- 
iteration. 

1.1 RL and ADP Problem Definition and Notation 

The typical RL/ADP scenario is an agent wandering around in an environment, 
such that at time t it has state vector xt- At each time t the agent chooses 



an action at which takes it to the next state according to the environment's 
model function Xt+i = f{xt,at), and gives it an immediate reward, r^, given 
by the function rt = r{xt-,at). In general these model functions / and r can 
be stochastic functions. The agent keeps moving, forming a trajectory of states 
(a?o, Xi, . . .), which terminates if and when a designated terminal state is reached. 
In RL/ADP, we aim to find a policy function, 7r(af), that calculates which action 
a = tt{x) to take for any given state x. The objective of RL/ADP is to find a 
pohcy such that the expectation of the total discounted reward, (X]t7*''*)> i^ 
maximised for any trajectory. Here 7 £ [0, 1] is a constant discount factor that 
specifies the importance of long term rewards over short term ones. 

There are only minor differences between the ADP and RL learning methods 
that we know of; one difference is that RL methods commonly place more em- 
phasis on model-free learning than ADP methods do, where as ADP methods 
often assume the model functions are already known and therefore can be made 
use of during learning. 

1.2 Approximate Value Function (Critic) and its Gradient 

We define V{x,w) to be the real- valued scalar output of a smooth function 
approximator with weight vector w and input vector x. This is the "approxi- 
mate value function" , or "critic" . We define G{x, w) as the "approximate value 

gradient", or the "critic gradient", to be G{x,w) = — gg^' . 

Here and throughout this paper, a convention is used that all defined vector 
quantities are columns, whether they are coordinates, or derivatives with respect 
to coordinates. So, for example, G, ^ and ^ are all columns. 

1.3 Greedy Policy 

The greedy policy is the function that always chooses actions as follows: 

a = arg max((5(z, a, It;)) Vx (1) 

where we define the approximate Q Value function (I Watkinsl . Il989l ) as 



Q{x,a,w) ^r{x,a)+'jV{f{x,a),w) (2) 

1.4 Trajectory Shorthand Notation 

Throughout this paper, all subscripted indices are what we call trajectory 
shorthand notation. These refer to the time step of a trajectory and provide 
corresponding arguments Xt and dt where appropriate; so that for example 
Ft+i ^ V{xt+,,w); (§) is shorthand for MM^ and (f|) is 



t 

shorthand for — X^Z^' 

OW / - -\ 

(XfW) 



2 Learning Algorithms and Definitions 

2.1 TD(A) Learning 



The TD(A) algorithm (jSuttoij . 119881 ) can be defined in batch mode by the fol- 
lowing weight update applied to an entire trajectory: 



(3) 



where A € [0,1], and a > are fixed constants, i?'*' is the (movin g) target 
for th is weight update. It is known as the "A-Return", as defined by IWatkind 
( 1989 i). For a given trajectory, this can be written concisely using trajectory 




shorthand notation by the recursion 

R\ ^rt+ 7(Ai?\+i + (1 - X)Vt+i) (4) 



with R^t = at any terminal state, as proven by ( Fair bank fc Alonsol l201l[ 



Appendix A). This equation introduces the dependency on A into eq. [3l Using 
the A-Return enables us to write TD(A) in this ve ry concise way, known as the 
"forwards view of TD(A)" ( Sutton fc Bartd . ll998r ). however the traditio nal way 



to imp lement the algorithm is using "eligibility traces" , as described by ISutton 



(198 



2.2 Sarsa(A) Algorithm 

Sarsa(A) is an algorithm for control problems that learns to approximate the 
Q{x,a,w) function (Rummerv fc Niranianl . Il994l) . It is designed for policies 



that are dependent on the Q(af, a, w) function (e.g. the greedy policy or a 
greedy policy with added stochastic noise), where Q{x,a,w) here is defined to 
be the output of a given function approximator. 

The Sarsa(A) algorithm is defined for trajectories where all actions after the 
first are found by the given policy; the first action Sq can be arbitrary. The 
function-approximator update is defined to be: 



dQ 



^- = "E B (Q't-Q.) (5) 



where Q^ is the target for this weight update. This is analogous to the A- 
return, but uses the function approximator Q in place of V. We can define Q^ 
recursively in trajectory shortand notation by 

Q\ = n + j{XQ\+, + {1 - X)Qt+i) (6) 

with Q\ = at any terminal state. 



2.3 The VGL(A) Algorithm 

To define the VGL(A) algorithm, throughout this paper we use a convention 
that differentiating a column vector function by a column vector causes the 
vector in the numerator to become transposed (becoming a row). For example 

g4 is a matrix with element {i,j) equal to Q^i ■ Similarly, ( f§ ) — f§r, 

and I ^ ) is this matrix evaluated at {xt, w). 

Using this notation and the implied matrix products, all VGL algorithms 
can be defined by a weight update of the form: 



^^ = "1.(5^) ^t{G't-G,) (7) 

where a is a small positive constant; Gt is the approximate value gradient; and 
G't is the "target value gradient" defined recursively by: 

with G' f = at any terminal state; where fit is an arbitrary positive definite 
matrix of dimension (dim x x dim x) ; and where -^ is shorthand for 

D_^d_ dnd_ 
Dx dx dx da 

and where all of these derivati ves are assumed to e xist. Equations [7l |8] and [9] 
define the VGL(A) algorithm. iFairbank fc Alonsd ( 20111) give further details. 



and pseudocode for both on-line and batch-m o de iin plementations. 

The fit matrix was introduced bv IWerboa ( 19981 ). and can be chosen freely 



by the experimenter, but it is difficult to decide how to do this; so for most 
purposes it is just taken to be the identity matrix. However for the special 
choice of 

for i = 



the algorithm VGL(l) is proven to converge IFairbank fc Alonsd (|2011l ) when 
used in conjunction with a greedy policy, and under certain smoothness as- 
sumptions. 

2.4 Definition of ADP Algorithms (HDP, DHP and GDHP) 

All of the ADP algorithms we will define here are particularly intended for 
the situation where V is implemented as the output of a neural network, and 
the policy function is implemented as the output of a second neural network. 
However for our divergence examples in this paper we are instead using the 
greedy policy. Excluding this difference, the three ADP algorithms we consider 
here can all be defined in terms of the algorithms defined so far in this paper. 



• The algorithm Heuristic Dynamic Programming (HDP) uses the same 
weight update for its V function as TD(0). 

• The algorithm Dual Heuristic Dynamic Programming (DHP) uses the 
same weight update for its G function as VGL(O). The function G{x,'w) 
is generally implemented as the output of a vector function approximator, 
i.e. without it having to explicitly be the gradient ^). 

• Globalized Dual Heuristic Programming (GDHP) uses a linear combina- 
tion of a weight update by VGL(O) and one by TD(0). 

3 Problem Definition For Divergence 

We define the simple RL problem domain and function approximator suitable 
for providing divergence examples for the algorithms being tested. 

First we define an environment with if G SR and a € K, and model functions: 



,. , ■. ] xt + at ii t £ {0,1} 

f{xt,t,at) = < (11a) 

\ xt 11 t — 2 



-kat^ ifte{0, 1} 
-xt'^ lit = 2 



r{xt,t,at) = { I ^_T (lib) 



where fc > is a constant. Each trajectory is defined to terminate at time step 
t = 3, so that exactly three rewards are received by the agent (rewards are given 
at timings as defined in section fl-H i.e. with the final reward r2 being received 
on transitioning from i = 2 to i = 3). In these model function definitions, action 
02 has no effect, so the whole trajectory is parametrised by just xo, oo and oi, 
and the total reward for this trajectory is —k{aQ^ + oi^) — {xq + oq + oi)^. 
These model functions are dependent on t, which is an abuse of notation we 
have adopted for brevity, but this could be legitimised by including t into x. 

3.1 Critic Definition 

A critic function is defined using a weight vector with just four weights, w = 

{—ciXi' + w\X\ + W3 if t = 1 
-C2a::2^ + ^22:2 + W4 if t = 2 (12) 

iftG{0,3} 

where C\ and ci are real positive constants. 

Hence the critic gradient function, G = ^, is given by: 

ni ^ -\ j-2ctxt+wt ifie{l,2} 
G(xt,t,w) ~ < ^ , (13) 

^ ^ (0 iftG{0,3} ^ ' 



We note that this imphes 

[ dG\ Jl if t e {1,2} andi = fc 




(14) 
otherwise 

3.2 Unrolling a greedy trajectory 

Substituting the model functions (eq. [Tl]) and the critic definition (eq. [T^ into 
the Q function definition (eq. ^ gives, with 7 = 1, 

Q{xt,t,at,w) 

-fc(ao)^ — ci {xo + ao)^ + TOi (a;o + ao) + '"^3 if i = 
— fc(ai)^ — C2{xi + ai)^ + W2{xi + ai) + W4 if i = 1 

In order to maximise this with respect to at and get greedy actions, we first 
differentiate to get. 



„ , = -2kat ~ 2ct+i(xt + at) + wt+i for i G {0, 1} 

oa I 
/ t 

= —2at{ct+i + k) + wt+i — 2ct+iXt for t G {0, 1} 

Hence the greedy actions are given by 

wi-2ciXo . 

2(ci + k) 

ai = -TT, — —rr (16) 

2(c2 + k) 

Following these actions along a trajectory starting at xq = 0, and using the 
recursion Xt+i = f{xt,at) with the model functions (eq. Ilip gives 

"^ = "° = 2(^ (^^) 

and X2=xi+ai = — — -— — — — — (18) 

2(c2 + fc)(ci + k) 

Substituting xi (eq. [TTt back into the equation for ai (eq. fT6)) gives ai purely 
in terms of the weights and constants!^ 

^ W2(C1 + k) - C2W1 , , 

""'- 2{c2 + k)(ci+k) ^'""^ 



^We emphasise that we are doing this step for the divergence analysis, and that this is not 
the way that VGL is meant to be implemented in practice. 



3.3 Evaluation of value-gradients along the greedy trajec- 
tory 

We can now evaluate the G values by substituting the greedy trajectory's state 
vectors (eqs. [T71ITS)) into eq. [T31 giving: 

n ciwi wik 

(ci + k) (ci + k) 

, p, W2{ci + k)c2 + kwiC2 , 

and G2 = -. -^-7 -TT — + W2 

(c2 + fc)(ci +k) 



W2k{ci + k) — kwiC2 
" (c2 + fc)(ci +k) 

The greedy actions in equations [15] and [16] both satisfy 
dA f^^ for tG {0,1} 



■.'^^ J t 1 otherwise 

Substituting eqs. [21] and [H] into §| = f| + f^H gives 

Df\ fl- "'+1, = — ^ iffG{0,l} 



Dr , ^r 



Similarly, substituting them into ^ = §^ + §f §^ gives 

Dr\ f0-,i^(-2fca0 = ^^f^ if t G {0, 1} 



Oa;/, -2a;t if f = 2 



(21) 



(22) 



(23) 



(24) 



3.4 Backwards pass along trajectory 

We do a backwards pass along the trajectory calculating the target gradients 
using eq. [8] with 7 = 1, and starting with G3 = (by eq. [13]) and G's = (since 
G'3 is at a terminal state): 

*^''"(^) byeq.[8]andG'3 = G3 = 



2 
22:2 by eq. [24] 

W2{ci+k) + kWl __, 

—r— —r— by eq. [T8] (25) 

(C2 + k)(ci + k) 



Similarly, 



G'l^f^l +(?i) Ug', + {1-X)G,) byeq.E] 



Dx J -^ \ Dx y ^ 



2fcc2ai ^ _k / ^,^ ^^^_ ^^^ ^ ^^ ^^^ i^^j^ 



C2 + k C2 + k 
kC2{w2{ci + k) — C2W1) 
(Ci+fc)(c2+fe)2 

, k f W2{ci+k)+kwi 



(C2 +fc)(ci +fc) 



C2+fcV (c2+fc)(ci+fc) 

Cl _ .N ^2fc(ci + fc) - 
(C2 + fc)(ci 
W2fe(C2 - A + fc(l - A)) 
' (C2 + fe)2 

Wlfc(fcA+(c2)^ + fc(l-A)c2) 
(Ci+fc)(c2+fc)2 



(26) 



4 Divergence Examples for VGL and DHP Al- 
gorithms 

We now have the whole trajectory and the terms G and G" written algebraically, 
so that we can next analyse the VGL(A) weight update for divergence. 
The VGL(A) weight update (eq. [7]) combined with D,t—1 gives 

= a{G\ - G,) (for i e {1, 2}, by eq. [H 

Mu;i\ _ (g\ - Gi\ 
^ \/\w2) - " [g'2 - G2J 

where A is a 2 x 2 matrix with elements which were found by subtracting 
equations [501 and [51] from equations [25] and [IS] respectively, giving, 

fc(fcA+(c2)2 + fc(l-A)c2) k 

^00 — — 



Ao 



(ci +A:)(c2 + fc)2 (ci+A:) 

A:(c2 + fc - A(A; + 1)) 



(C2 + k) 



2 



fc(c2 - 1) ^ -1-fc 

'" (c2 + fc)(ci+A;) " (c2 + fc) 



Equation[27]is the VGL(A) weight update written as a single dynamic system 
of just two variables, i.e. a shortened weight vector, w = {wi,W2)'^- To add 
further complexity to the system, in order to achieve the desired divergence, 
we next define these two weights to be a linear function of two other weights, 
p= {pi,P2)'^, such that w = Fp, where F is a 2 x 2 constant real matrix. The 
VGL(A) weight update equation can now be recalculated for these new weights, 
as follows: 

Ap^a^f^J (G't-Gt) by eq. Eland a=l 

= " E a| f a^ ) (^'* - ^* ) by chain rule 

= a-—^Aw by eq. [27] 

op 

= a{F^AF)p. hy w = Fp and ^ = ^^ = F^ 

op op 

(28) 

Taking a > to be sufficiently small, then the weight vector p evolves 

according to a continuous-time linear dynamic system given by eq. [28l and this 

system is stable if and only if the matrix product F^AF is "stable" (i.e. if the 

real part of every eigenvalue of this matrix product is negative). 

/ _o 75 5 
Choosing A = 0, with ci = C2 = fc = 0.01 gives ^ = I 94 yc 

Choosing -F == I , -, I makes F'^ AF = I ^onn 97 n ) '^^lich has eigen- 
values 45 ± 45.22i. Since the real parts of these eigenvalues are positive, eq. 
[28] will diverge for VGL(O) (i.e. DHP). In an extended analysis, we found that 
these parameters also cause VGL(O) to diverge when the ^t matrices are in- 
cluded according to equation [10] 

Since GDHP is a linear combination of DHP, which we have proven to di- 
verge, and TD(0) (which we prove to diverge below), it follows that GDHP can 
diverge with a greedy policy too. 

Also, perhaps surprisingly, it is possible to get instability with VGL(l). 

r^^. ■ , nni c. c.c. ■ ^ /-0.2625 -24.75\ „. 

Choosmg C2 — k — 0.01, ci — 0.99 gives ^ = I p, ^qr r,o.r,]- Choosmg 

-1 -1\ , ^T,-^ /2.7665 0.1295\ ^. ^ ^ , 

„ _„ I makes r At' — \ a A^rA n. 9999 I which has two positive real 

eigenvalues. Therefore this VGL(l) system diverges. 

The divergence r esult for VGL(l) does not affect the convergence result by 
Fairbank fc Alonsd ( 20111 ) which is for VGL(l) but with the special choice of 



r^t given by eq. 1101 It was not possible to make this algorithm diverge with the 
methods of this paper. 
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5 Divergence results for TD(A) and Sarsa(A) and 
HDP 

To satisfy the exploration requirement for exploration in TD(A)-based algo- 
rithms, we supplemented the greedy policies (eqs. ll5l fc [T6t with a small amount 
of stochastic Gaussian noise with zero mean. (We had to add this noise, since 
it is well known that these classic RL algorithms must be supplemented with 
some form of exploration. This is the classic "exploration versus exploitation" 
dilemma. Without exploration, these algorithms do not converge to an optimal 
policy, in general. Specifi c examples of co nverging to the wrong policy without 
exploration are given by ( Fairbankl l2008l . Appendix B).) 



To achieve divergence of these algorithms with the noisy greedy policy, 
we used exactly the same learning and environment constants as used for the 
VGL(O) and VGL(l) divergence experiments. These choices of parameters, with 
the stochastic noise added to the greedy policy, made TD(0) and TD(1) diverge 
respectively, in empirical tests. Source code for this is provided. Hence HDP 
diverges too, since this is equivalent to TD(0) with the given policy. 

An insight into why the divergence parameters for VGL were sufficient to 
make the TD(A) based algorithms diverge too is because TD with stochastic 
exploration can be understood to be an approximation to a stochastic version 
of VGL(A), so we would expect a divergence example for VGL to cause divergence 
for TD(A) too. 

Without the stochastic noise added to the greedy policy, these examples 
would not diverge, but instead converge to a sub-optimal policy, which is also 
considered a failure. 

5.1 Divergence results for Sarsa(A) 

We next prove divergence for Sarsa(A) by choosing a function approximator 
for Q that makes the Sarsa(A) weight update equivalent to the TD(A) weight 
update, so that the divergence result for TD(A) carries over to Sarsa(A). 

Sarsa{A) is designed to work with an arbitrary function approximator for 
Q{x,a,w). We will define our Q function exactly by Eq. [2] Rearranging eq. [6] 
gives 



XQ\+, + (1 - \)Qt+i 
= >^Q\+i + (1 - A)(r,+i + jVt+2) by eq. M 

= n+i + HQ\+i - n+i) + (i - X){jVt+2) 

= ^*+i + 7 (a [ ^''^'^''^' ) + (1 - m+2^ (29) 

From this we can see that ( ' ''* J obeys the same recursion equation as 
R^, and they have the same endpoint (since both are zero at a terminal state). 
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from which we can conclude (e.g. by comparing recursion equations 1291 and H)) 
that 



Substituting this into the Sarsa(A) weight update (eq. [5]), with eq. [H and 
simphfying gives 

Aw = a} [ —^ [rt+ jR^t+i - in + 7^«+i) 




t>o \ / t 



which is identical to TD(A) but with summation over t now excluding t — 0, and 
with an extra constant factor, 7^. The divergence example we derived above 
used 7=1, and had no weight update term for t = 0, so uses an identical 
weight update. Therefore this particular choice of function approximator for 
Q and problem definition causes divergence for Sarsa(A) (with both A = 1 and 
A = 0). 

6 Conclusions 

We have shown that under a value-iteration scheme, i.e. using a greedy policy, 
all of the RL algorithms have been made to diverge, and all but one of the VGL 
algorithms have been made to diverge. The algorithm we found that didn't 
di verge was VGL(l) w i th fit as d efined by eg . [TU[ which is proven to converge 
bv lFairbank fc Alonsol ( 20111) and iFairbankI (|2008l ) under these conditions. 



These are new divergence results for TD(0), Sarsa(O), TD(1) and Sarsa(l), 
in that previous examples of divergence have only been for TD(0) and for non- 
greedy policies (JBairdl . Il995t iTsitsiklis fc Van Rovl . Il996bl la[l . The divergences 



we achieved for TD(1) and Sarsa(l) were only possible because of the use of a 
greedy policy. 

It is hoped that these specific examples of divergence of value-iteration will 
provide a better understanding of how it can happen, and help motivate research 
to understand and prevent it. 

A conclusion of this work is that the diverging algorithms considered cannot 
currently be reliably used for value-iteration, and instead can only be used under 
some form of "policy iteration" if provable convergence is required. However 
there are some distinct advantages of value-iteration over policy-iteration that 
we summarise here: 
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Sutton et al.l (|2000l ) describe conditions under which pohcy iteration prov- 



ably converges. These conditions are thought to apply only when the function 
approximator for V is linear in the same features of the state vector that the 



function approximator for the policy uses as input (see footnote 1 of lSutton et al 



(2000)). Also policy iteration in general has an inner loop of training the value- 



function to completion, over the whole of state space, for the current fixed policy, 
which is an extremely computationally intensive process (taking theoretically an 
infinite time to complete). And this inner loop is combined in an outer loop 
that, for provable convergence, must train the policy function at a learning rate 
that tends to zero; so policy-iteration is prohibitively computationally expensive 
in comparison to value-iteration. 
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